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Abstract We study distributed optimization algorithms for minimizing the average of convex functions. 
The applications include empirical risk minimization problems in statistical machine learning where the 
datasets are large and have to be stored on different machines. We design a distributed stochastic variance 
reduced gradient algorithm that, under certain conditions on the condition number, simultaneously 
achieves the optimal parallel runtime, amount of communication and rounds of communication among 
all distributed first-order methods up to constant factors. Our method and its accelerated extension 
also outperform existing distributed algorithms in terms of the rounds of communication as long as the 
condition number is not too large compared to the size of data in each machine. We also prove a lower 
bound for the number of rounds of communication for a broad class of distributed first-order methods 
including the proposed algorithms in this paper. We show that our accelerated distributed stochastic 
variance reduced gradient algorithm achieves this lower bound so that it uses the fewest rounds of 
communication among all distributed first-order algorithms. 
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1 Introduction 

In this paper, we consider the distributed optimization problem of minimizing the average of N convex 
functions in K.'^, i.e.. 
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using m machines. For simplicity, we assume N = mn for an integer n with m n but all of our results 
can be easily generalized for a general N. Here, /^ : —>■ K for i = 1,..., iV is convex and L-smooth, 

meaning that fi is differentiable and its gradient V/i is L-Lipschitz continuou^, i.e., || V/i (a;)-V/,(y)|| < 
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^ In this paper, the norm || • || represents Euclidean norm. 





L||a: —2/11, Vx,//S R'^, and their average / is /i-strongly convex, i.e., \\V f{x) — V f{y)\\ > fj,\\x — y\\, Va;, 2 /S 
We call K = ^ the condition number of function /. Note that the function / itself can be L/-smooth, 
namely, \\\7 f (x) — f {y)\\ < Lf\\x — y\\, 'ix,y £ R'^, for a constant L/ < L. Let x* be the unique optimal 
solution of o and a solution x is called an e-optimal solutior^ for ([1]) if f{x) — f{x*) < e. 

One of the most important applications of problem © is empirical risk minimization (ERM) in 
statistics and machine learning. Suppose there exists a set of i.i.d. samples {Cij^ 2 , ■ • ■ j from an 
unknown distribution D oi a random vector An ERM problem can be formulated as 
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where x represents a group of parameters of a predictive model, is the zth data point, and 4>(x,^) 
is a loss function. Note that m has the form of 0 with each function fi{x) being (j){x,^i). Typically, 
the data point ^ is given as a pair (a, b) where a £ R'^ is a feature vector and 6 £ R is either a 
continuous (in regression problems) or a discrete response (for classification problems). The examples 
of loss function (j){x^ with ^ = (a, b) include: square loss in linear regression where a £ R"^, 6 £ R, 
and (l){x,^) = {a^x — b)^; logistic loss in logistic regression where a £ R'^, b £ {1,-1}, and (j){x,^) = 
log(l + exp(—6(a’^x)); smooth hinge loss where a £ R'^, b £ {1, —1}, and 


To if baz^x > 1 

(j){x,^) = s ^ if ba"^x < 0 

[ ^(1 — ba^xY otherwise. 

To improve the statistical generalization properties of the model learned from ([2]) , a regularization term 
^||x|p is often added to © and the problem becomes a regularized ERM problem 
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which still takes the form of 0 with/i(x) = (/)(x,^i) + ^||x||^. The parameter A is called an regularization 
parameter. As argued by for ERM problem, the value of A is typically in the order of 

0{l/\fN) = 0{1/ 

We consider a situation where all N functions are initially stored in the same large storage space 
that has limited computation power. We assume that each of the m machines we use to solve o has a 
limited memory space of C so that it can load at most C of the N functions in ©. In the case of ERM, 
this means each machine can load at most C data points among {Cii C 2 , • ■ • > Cw} in its memory. Since 
the data point uniquely defines fi in ERM, in the rest of the paper, we will call fi a data point i or a 
function i interchangeably. 

Throughout the whole paper, we assume that 


Assumption 1 The memory space C of each machine satisfies n < C < N and the quantity n = C — n 
satisfies h > cn for a universal constant c > 0. 

The inequality C < N forces us to use more than one, if not all, the machines for solving ([T]). The 
quantity h represents the remaining space in each machine after we evenly allocate N data points onto 
m machines. The inequality h > cn means each machine still has f2{n) memory space after such an 
allocation of data. This can happen when either the machine capacity C or the number of machines m 
is large enough. 

We also assume that we can load the same function to multiple machines so that different machines 
may share some functions, so the sets of functions in all machines do not necessarily form a partition of 
{/i}ig[jv]- Since no machine can access all N functions, we have to solve © by distributed algorithms 
that alternate between a local computation procedure at each machine, and a round of communication 
to synchronize and share information among the machines. 

^ If £ is a random variable generated by a stochastic algorithm, we call it an e-optimal solution if E[/(x) — f{x*)] < e. 
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1.1 Communication efficiency and runtime 

To facilitate the theoretical study, we use the following simplified message passing model from the 
distributed computation literature [9115] : We assume the communication occurs in rounds - in each 
round, (a subset of) machines exchanges messages and, between two rounds, the machines only compute 
based on their local information (local data points and messages received before). 

Given this state of affairs, we study the distributed optimization problem with three performance 
metrics in mind. 

— Local parallel runtime: The longest running time of m machines spent in local computation, 
measured in the number of gradient computations, i.e., computing Vfi{x) for any i. We also refer it 
as “runtime” for simplicity. 

— The amount of communication: The total amount of communication among m machines and the 
center, measured by the number of vector^ of size d transmitted. 

— Rounds of communication: How many times all machines have to pause their local computation 
and exchange messages. We also refer it as “rounds” for simplicity. 

We will study these performance metrics for the algorithms we propose and compare with other existing 
techniques. However, the main focus of this paper is the rounds of communication. 


1.2 Summary of contributions 

In this paper, we first propose a distributed stochastic variance reduced gradient (DSVRG) method, which 
is simple and easy to implement - it is essentially a distributed implementation of a well-known single¬ 
machine stochastic variance reduced gradient (SVRG) method [TTll26U12j . We show that the proposed 
DSVRG algorithm requires 0((1 -I- ^) log(l/e)) rounds of communication to find an e-optimal solution 
for dD) under Assumption [T] The corresponding parallel runtime is 0{{n + K) log(l/e)) and the associated 
amount of communication is 0((m -|- ^) log(l/e)). 

Given these performance metrics of DSVRG, we further ask a key question: 

How can we achieve the optimal parallel runtime, the optimal amount of communication, and the 
optimal number of rounds of communieation simultaneously for solving m? 

This paper answers this seemingly ambitious question affirmatively in a reasonable situation: When 
K = with a constant 0 < i5 < ^, with an appropriate choices for the parameters in DSVRG 

(shown in Gorollary [1} , DSVRG finds an e-optimal solution for ([1]) with a parallel runtime of 0{n), 
an 0{m) amount of communication and 0(1) rounds of communication for any e = ^ where s is any 
positive constant. Here, the notation O hides a logarithmic term of the optimality gap of an initial 
solution for DSVRG, which is considered as a constant in the whole paper. 

We want to point out that k = is a typical setting for machine learning applications. 

For example, as argued by for ERM, the condition number k is typically in the order of 

0{y/N) = 0{y/mn). Therefore, when the number of machines m is not too large, e.g., when m < 
we have that k = 0{^fmn) < ® (so that 8 = 0.05). Moreover, e = (so that s = 10) is certainly a 

high enough accuracy for most machine learning applications since it exceeds the machine precision of 
real numbers, and typically people choose e = 0(^) in empirical risk minimization. 

These performance guarantees of DSVRG, under the specific setting where k = 0(n^~'^^) and e = 
O(^), are optimal up to constant factors among all distributed first-order methods. First, to solve ([T|), 
all m machines together need to compute at least Q{N) gradients [T] in total so that each function in 
can be accessed at least once. Therefore, at least one machine needs to compute at least fi{n) 
gradients in parallel given any possible allocation of functions. Second, the amount of communication 
is at least fi{m) for even simple Gaussian mean estimation problems [3], which is a special case of 
©• Third, at least 0(1) rounds of communication is needed to integrate the computation results from 
machines into a final output. 

Furthermore, using the generic acceleration techniques developed in [S] and m , we propose a 
distributed accelerated stochastic variance reduced gradient (DASVRG) method that further improves 

^ We will only consider communicating data points, or iterates x. For simplicity, we assume that the data point and 
iterates are of the same dimension, but this can be easily generalized. 

If n = 10^, then n®'® = 10“^ which is already much more than the number of machines in most clusters. 
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Algorithm 

Rounds 

Parallel Runtime 

Assumptions 

DSVRG 


G{n + k) log i 

Assumption^ 

DASVRG 

(l + y^)log(l + ^)logi 

G{n + Uuk) log(l -I- ^) log i 

Assumption^ 

DISGO (quad) 


+ 7^) i°g ^ 

m, ~ D 

DISCO (non-quad) 


Qd '^^ ((l-l- :^)logi -I- 

0 . ~ D 

DANE (quad) 

(l + ^)l°g7 

Q(i + ^)logi 

0, D 

C 0 G 0 A+ 

Klogi 

G{n -)- ^/Kn)K, log L 

0 

Accel Grad 

log i 

Gn^/lTf log i 



Table 1 Rounds and runtime of different distributed optimization algorithms in a general setting. Let G be 

the computation cost of the gradient of fi. For typical problems such as logistic regression, we have G = 0(d). Let Q be 
the cost of solving a linear system. We have Q = 0{d^) if exact matrix inversion is used and have Q = 0(dny^log ■^) if 
an e-approximate inverse is found by an accelerated gradient method m 


the theoretical performance of DSVRG. Under Assumption [TJ we show that DASVRG requires only 
0((1 + y^)log(l/e)) rounds of communication to find an e-optimal solution, leading to better theo¬ 
retical performance than DSVRG. Also, we show that the runtime and the amount of communication 
for DASVRG are 0((n -I- log(l/e)) and 0(m -I- my^) log(l/e)), respectively. We also prove a 
lower bound on the rounds of communication that shows any first-order distributed algorithm needs 
l7( V^) log(l/e)) rounds of communication. It means DASVRG is optimal in that it uses the least 
number of rounds of communication. Since our lower bound indeed can be applied to a broad class of 
distributed first-order algorithms, it is interesting by itself. Here, and in the rest of the paper, O and l7 
hide some logarithmic terms of k, V, m and n. 

The rest of this paper is organized as follows. In Section[2l we compared the theoretical performance of 
our methods with some existing work in distributed optimization. In Section [3] and Sectional we propose 
our DSVRG and DASVRG algorithms, respectively, and discuss their theoretical guarantee. In Section 
we prove a lower bound on the number of rounds of communication that a distributed algorithm 
needs, which demonstrates that DASVRG is optimal. Finally, we present the numerical experiments in 
Section [HI and conclude the paper in Section [T] 


2 Related Work 

Recently, there have been several distributed optimization algorithms proposed for problem ([T]). We list 
several of them, including a distributed implementation of the accelerated gradient method (Accel Grad) 
by Nesterov in Tableland present their rounds and runtime for a clear comparison. The algorithms 
proposed in this paper are DSVRG and DASVRG. 

The distributed dual coordinate ascent method, including DisDGA [37], GoGoA [TOj and GoGoA-f [TS] . 
is a class of distributed coordinate optimization algorithms which can be applied to the conjugate dual 
formulation of Q. In these methods, each machine only updates n dual variables contained in a local 
problem defined on the n local data points. Any optimization algorithm can be used as a subroutine in 
each machine as long as it reduces the optimality gap of the local problem by a constant factor. According 
to [TSlfTU] . GoGoA-f requires 0(Klog(l/e)) rounds of communication to find an e-optimal solutioi|^. If 
the accelerated SDGA method is used as the subroutine in each machine, the total runtime for 

GoGoA-f is 0{{n + ^/m)K\og{l/e)). Therefore, both DSVRG and DASVRG have lower runtime and 
communication than GoGoA-f, and the other distributed dual coordinate ascent variants. 

Assuming the problem (|T]) has the form of (|3|) with ^^’s i.i.d. sampled from a distribution D (denoted by 

D), the DANE and DISGO [28] algorithms require 0((1-I-^^) log(l/e)) and 0((1-|-^) log(I/e)) 
rounds of communication, respectively. Hence, DSVRG uses fewer rounds of communication than DANE 
and fewer than DISCO when k < DASVRG always uses fewer rounds of communication than 
DISCO and DANE. Note that, for these four algorithms, the rounds can be very small in the “big data” 

^ This is the accelerated gradient method by Nesterov na except that the data points are distributed in m machines to 
parallelize the computation of V/(fc). 

® CoCoAH- has a better theoretical performance than CoCoA. According to C 0 C 0 A+ is equivalent to DisDCA with 
“practical updates” m under certain choices of parameters. 
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Algorithm 

Rounds 

Runtime 

Assumptions 

DSVRG 

(l + yf)l°g7 

G{n + y/mn) log i 

Assumption^ 

DASVRG 

(l-H(S).25)iog(l-^yS)logi 

G{n + log(l + log i 

Assumption^ 

DISGO (quad) 

log j 

Qm'25 log i 

~ D 


Table 2 Rounds and runtime of DSVRG, DASVRG and DISCO when k = The coefficients Q and G are defined 

as in Tabled 


case of large n. Indeed, as n increases, all four methods require only 0(log i) rounds which is independent 
of the condition number k. 

Moreover, DISCO and DANE have large running times due to solving a linear system each round, 
which is not practical for problems of large dimensionality. As an alternative, Zhang and Xiao |28) 
suggest solving the linear system with an inexact solution using another optimization algorithm, but 
this still has large runtime for ill-conditioned problems. The runtimes of DISCO and DANE are shown 
in Table [T] with Q = 0{cP) to represent the time for taking matrix inverse and Q = 0{dny/nlog^) 
to represent the the time when an accelerated first-order method is used for solving the linear system. 
In both case, the runtimes of DISCO and DANE can be higher than those of DSVRG or DASVRG 
when d is large. Eurthermore, DANE only has the theoretical guarantee mentioned above when it is 
applied to quadratic problems, for example, regularized linear regression. Also, DISCO only applies to 
self-concordant functions with easily computed Hessiar0, and makes strong statistical assumptions on 
the data points. On the contrary, DSVRG and DASVRG works for a more general problem m and do 
assume D for ([3]). 

We also make the connection to the recent lower bounds [2] for the rounds of communication needed by 
distributed optimization. Arjevani and Shamir [2] prove that, for a class of ^-related functions (see [2] for 
the definition) and, for a class of algorithms, the rounds of communication achieved by DISGO is optimal. 
However, as mentioned above, DASVRG needs fewer rounds than DISCO. This is not a contradiction 
since DASVRG does not fall into the class of algorithms subject to the lower bound in [2]. In particular, 
the algorithms concerned by [2] can only use the n local data points from the initial partition to update 
the local solutions while DASVRG samples and utilizes a second set of data points in each machine in 
addition to those n data points. 

Building on the work of [2], we prove a new lower bound showing that any distributed first-order 
algorithm requires O(y^logi) rounds. This lower bound combined with the convergence analysis of 
DASVRG shows that DASVRG is optimal in the number of rounds of communication. 

In Tabled we compare the rounds and runtime of DSVRG, DASVRG and DISGO in the case where 
K = = 0{\/mn), which is a typical setting for ERM problem as justified in [28ll25l[24l[23] . We 

only compare our methods against DISCO, since it uses the fewest rounds of communication among 
other related algorithms. Let us consider the case where n > m, which is true in almost any reasonable 
distributed computing scenario. We can see from Table [3] that the rounds needed by DSVRG is lower 
than that of DISCO. In fact, both DSVRG and DASVRG use 0(log rounds of communication, which 
is almost a constant for many practical machine learning applicationqfl. 


3 Distributed SVRG 

In this section, we consider a distributed stochastic variance reduced gradient (DSVRG) method that is 
based on a parallelization of SVRG SVRG works in multiple stages and, in each stage, one 

batch gradient is computed using all N data points and 0{k) iterative updates are performed with only 
one data point processed in each. Our distributed algorithm randomly partitions the N data points onto 
m machines with n local data points on each to parallelize the computation of the batch gradient in 
SVRG. Then, we let the m machines conduct the iterative update of SVRG in serial in a “round-robin” 
scheme, namely, let all machine stay idle except one machine that performs a certain steps of iterative 
updates of SVRG using its local data and pass the solution to the next machine. However, the only 

^ The examples in m all take the form of fi{x) = g{aj x) for some function ^ on , which is more specific than 
Under this form, it is relatively easy to compute the Hessian of fi. 

® Typically e E (10“®, 10“^), so log ^ is always less than 20. 
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caveat in this idea is that the iterative update of SVRG requires an unbiased estimator of Vf{x) which 
can be constructed by sampling over the whole data set. However, the unbiasedness will be lost if each 
machine can only sample over its local data. To address this issue, we use the remaining n = C — n 
memory space of each machine to store a second set of data which is uniformly sampled from the whole 
data set before the algorithm starts. If each machine samples over this dataset, the unbiased estimator 
will be still available so that the convergence property can be inherited from the single-machine SVRG. 

To load the second dataset mentioned above onto each machine, we design an efficient data allocation 
scheme which reuses the randomness of the first partitioned dataset to construct this second one. We 
show that this method helps to increase the overlap between the first and second dataset so that it 
requires smaller amount of communication than the direct implementation. 


3.1 An efficient data allocation procedure 

To facilitate the presentation, we define a multi-set as a collection of items where some items can be 
repeated. We allow taking the union of a regular set S and a multi-set R, which is defined as a regular 
set S U R consisting of the item in either S or R without repetition. 

We assume that a random partition Si,., Sm of [N] can be constructed efficiently. A straightforward 
data allocation procedure is to prepare a partition ,..., Sm of [N] and then sample a sequence of Q 
i.i.d. indices ri,...,rg uniformly with replacement from [TV]. After partitioning ri, ... ,rQ into m multi¬ 
sets Ri,..., Rm C [TV], we allocate data {fi \ i G Sj U Rj} to machine j. Since ^i,..., Sm has occupied 
n of the memory in each machine, Q can be at most hm. Note that the amount of communication in 
distributing Si,..., Sm is exactly TV which is necessary for almost all distributed algorithms. However, 
this straightforward procedure requires an extra 0{Q) amount of communication for distributing Rj\Sj. 

To improve the efficiency of data allocation, we propose a procedure which reuses the randomness 
of Si, ..., Sm to generate the indices ri,..., rg so that the the overlap between Sj and Rj can be 
increased which helps reduce the additional amount of communication for distributing i?i,..., Rm. The 
key observation is that the concatenation of 5'i,..., Sm is a random permutation of [TV] which has already 
provided enough randomness needed by Ri,... ,Rm. Hence, it will be easy to build the i.i.d. samples 
ri,... ,rQ by adding a little additional randomness on top of ^i,..., Sm- With this observation in mind, 
we propose our data allocation procedure in Algorithm [TJ 


Algorithm 1 Data Allocation : DA(TV, m, Q) 

Input: Index set [A^], the number of machines m, and the length of target sequence Q. 

Output: A random partition 5i,... , Sm of [N ], indices ri,... , rg E [A^], multi-sets Ri ,..., Rm C [A^], and data {fi \ i E 
Sj U Rj } stored on machine j for each j E [m]. 


Center samples ri,.. ., rg and /?i,..., Rm as follows: 


1 

2 

3 

4 


Randomly partition [N] into m disjoint sets Si,, Sm of the same size n = —. 
Concatenate the subsets Si, ..., Sm into a random permutation ii,... ,iN of [A^] so that Sj 
for ^ = 1 to Q do 
Let 


J i£ with probability 1 — 

\ i£/ with probability ^ for = 1,2,... ,i — 1. 


5: end for 

6: Let 




Rj = 


{r(j-i)n^i,.. .,rjf,} ilj = l,..., \Q/h] - 1 
{r(iQ/fi]-i)f>,+l^---,rQ} if j = iQ/n] 

0 if j = iQ/n] + 1,... ,m. 


(4) 


Distribute data points to machines: 

7: Machine j acquires data points in {fi\i S Sj U Rj} from the storage center. 


The correctness and the expected amount of communication of Algorithm [T] are characterized as 
follows. 
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Lemma 1 The sequence ri,...,rQ generated in Algorithm[I\ has the same joint distribution as a se¬ 
quence of i.i.d. indices uniformly sampled with replacement from [TV]. Moreover, the expected amount of 
communication for distributing U™ £ Rj\Sj} is at most 

Proof Conditioned on Ti,... ,it.-i and ri,... the random index has uniform distribution over 

[A^j \{ii,..., ii-i\. Therefore, by Line 3 in Algorithm[Tl the conditional distribution of the random index 
ri, conditioning on zi,..., Z£_i and ri,..., is a uniform distribution over [TV]. Hence, we complete 
the proof of the first claim of the lemma. 

To analyze the amount of communication, we note that i^ ^ ri with probability . Suppose ig € Sj 
for some j. We know that the data point fr^ needs to be transmitted to machine j separately from Sj 
only if ii rg. Therefore, the expected amount of communication for distributing U™ ]^{/i|z £ Rj\Sj} is 
upper bounded by 

According to Lemma[T] besides the (necessary) TV amount of communication to distribute Si,, Sm, 
Algorithm [T] needs only ^ additional amount of communication to distribute Ri,..., Rm thanks to the 
overlaps between Sj and Rj for each j. This additional amount is less than the 0{Q) amount required 
by the straightforward method when Q < N. 

Note that, in the DSVRG algorithm we will introduce later, we need Q = 0(Avlog(l/e)). When 
K = 0[y/N) in a typical ERM problem, we have ^ d°g^i/e)) _ (9((log(l/e))^) which is typically 

much less than the TV amount of communication in distributing ^i,..., Sm- In other words, although 
DSVRG does require additional amount of communication to allocate the data than other algorithms, 
this additional amount is nearly negligible. 


3.2 DSVRG algorithm and its theoretical guarantees 

With the data {fi | z £ S'j U Rj} stored on machine j for each j £ [m] after running Algorithm [U we are 
ready to present the distributed SVRG algorithm in Algorithm [5] and Algorithm [31 

We start SVRG in machine k with k = 1 initially at an initial solution £ R.'^. At the beginning 
of stage £ of SVRG, all m machines participate in computing a batch gradient h^ in parallel using the 
data indexed by Si,..., Sm- Within stage i, in each iteration, machine k samples one data ft from its 
local data indexed by Rk to construct a stochastic gradient Vfi{xt) — fi{x^) + h^ and performs the 
iterative update. Since Rk is a multi-set that consists of indices sampled with replacement from [TV], the 
unbiasedness of S/fi{xt) — ^fi{x^) -\- h^, i.e., the property 

- V/,(i^) + h^] = - V/,(i^) + h^] = Wfixt) (5) 

is guaranteed. After this iteration, i is removed from Rk. 

The m machines do the iterative updates in the order from machine 1 to machine m. Once the current 
active machine, says machine k, has removed all of its samples in Rk (so that Rk = 0), then it must pass 
the current solution and the running average of all solutions generated in the current stage to machine 
k 1. At any time during the algorithm, there is only one machine updating the solution xt and the 
other TO — 1 machines only contribute in computing the batch gradient h^. We want to emphasis that it 
is important that machines should never use any samples in Rj's more than once since, otherwise, the 
stochastic gradient Vfi{xt) — Vfi{x^) + h^ will lose its unbiasedness. We describe formally each stage of 
this algorithm in Algorithm [3] and the iterative update in Algorithm [31 

Note that an implicit requirement of Algorithm [2] is TK = Q < hm. Because one element in Rk is 
removed in each iterative update (Line 3) of Algorithm [31 this update cannot be performed any longer 
once each Rk becomes empty. Hence, the condition TK = Q ensures that the number of iterative updates 
matches the total number of indices contained in Ri,R 2 ,... ,Rm- Note that, in the worse case, Sj and 
Rj may not overlap so that the size of data {fi \ i € Sj U Rj} stored on machine j can be [S’^l -I- 
That is why the size of Rj is only h = C — n. The condition TK = Q < hm is needed here so the 
total amount of data stored in all machines do not exceed the total capacity Cm. The convergence of 
Algorithm [3] is established by the following theorem. 

Theorem 1 Suppose 0 < rj < ^ and TK = Q < hm. Algorithm}^ guarantees 
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Algorithm 2 Distributed SVRG (DSVRG) 

Input: An initial solution 5*^ E data , the number of machine m, a step length 77 < the number of 

iterations T in each stage, the number of stages K, and a sample size Q = TK. 

Output: 

1: Use DA to generate (a) a random partition , Sm of [N], (b) Q i.i.d. indices ri, ... ,rQ uniformly sampled with 

replacement form [N], (c) m multi-sets TZ = {.Ri,..., Rm} defined as (H, and (d) data {fi | z E Rj U Rj} stored on 
machine j. 

2: k^l 

3: for ^ = 0,1, 2,..., A" — 1 do 

4: Center sends to each machine 

5: for machine j = 1, 2, ..., m in parallel do 

6 : Compute send it to center 

7: end for 

8 : Center computes ^ send it to machine k 

9: (5^+^7^,fc) SS-SVRG(^^/l,7^,/^:,7?,T) 

10: end for 


Algorithm 3 Single-Stage SVRG: SS-SVRG(x, R, 7^, A:, r;, T, 

Input: A solution x E data ^ batch gradient h, m multi-sets R = {Ri^R 2 , •. . ,-Rm}, the index of the 

active machine k^ a step length r) < and the number of iterations T. 

Output: The average solution xt, the updated multi-sets R = {Ri, R 2 , • • •, Rm}, and the updated index of active machine 
k. 

1: XQ = X and xq = 0 

2: for t = 0,1,2,... ,r - 1 do 

3: Machine k samples an instance i from R]^ and computes 

Xt+I = Xt-ri{V fi{xt)fi{x) + h) , St+i = Rk-^Rk\{i} 


4: if Rfe = 0 then 

5: xt+i and ^t+i are sent to machine k + 1 

6: fc ■<— fc -|- 1 

7: end if 

8: end for 


E im -) - /( X -)] < [«"“) - ^("‘>1 ■ 

In particular, when rj = T = 96k and TK = Q < hm, Algorithm\^needs K = iog( 9 /s) ^ 

stages to find an e-optimal solution. 

Proof In the iterative update given in Line 3 of Algorithni[31 a stochastic gradient Vfi{xt) — Vfi{x) + h is 
constructed with h being the batch gradient V/(a;) and i sampled from in the active machine k. Since 
i is one of the indices ri,..., rg, each of which is sampled uniformly from [TV] , this stochastic gradient is 
unbiased estimator of Vf{xt). Therefore, the path of solutions x^,x^,x^,... generated by Algorithm[2] 
has the same distribution as the ones generated by single-machine SVRG so that the convergence result 
for the single-machine SVRG can be directly applied to Algorithmic] The inequality ([B]) has been shown 
in Theorem 1 in m for single-machine SVRG, which now also holds for Algorithm [0 
When r ] = and T = 96k, it is easy to show that 


1 4:Lr](T+l) 1 SLt; 2 2 8 

/rry(l - ALr])T (1 - ALr])T “ ^ 77(1 - 4Lr])T (1 - ALg) 9 3 9 ^ ^ 

so that Algorithm Id needs K = yg) log ^7 (3^ )-f{x stages to find an e-optimal solution. □ 

By Theorem Id DSVRG can find an e-optimal solution for (P) after K = 0(log(l/e)) stages with 
T = 0{k) iterative updates (Line 3 of Algorithm [3|) in each stage. Therefore, there are 0(k log(l/e)) 
iterative updates in total so that Q must be TK = 0(k log(l/e)). Since the available memory space 
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requires Q < nm = (C — n)m. We will need at least C = Q{n + — log(l/e)) in order to implement 
DSVRG. 

Under the assumptions of Theorem [1] including Assumption [T] (so h = 17 (n)), we discuss the theo¬ 
retical performance of DSVRG as follows. 

— Local parallel runtime: Since one gradient is computed in each iterative update and n gradients 
are computed in parallel to construct the batch gradient, the total local parallel runtime for DSVRG 
to find an e-optimal solution is 0{{n -|- T)K) = 0{{n + n) log(l/e)) 

— Communication: There is a fixed amount of 0{N) communication needed to distribute the par¬ 
titioned data Si,, Sm to m machines. When Algorithm [1] is used to generate Ri,..., Rm, the 
additional amount of communication to complete the data allocation step of DSVRG is 0{Q‘^/N) = 

log^(l/e)/V) in expectation according to Lemma [I] During the algorithm, the batch gradient 
computations and the iterative updates together require 0((m-|- ^)K) = 0{{m+^) log(l/e)) amount 
of communication. 

— Rounds of communication: Since DSVRG needs one round of communication to compute a batch 

gradient at each stage (one call of SS-SVRG) and one round after every h iterative update (when 
Rk = 0), it needs 0{K + = 0((1 -I- ^)log(l/e)) rounds of communication in total to find an 

e-optimal solution. We note that, if the memory space C in each machine is large enough, the value 
of fi can larger than k so that the rounds of communication needed will be only 0(log(l/e)). 


3.3 Regimes where DSVRG is Optimal 

In this subsection, we consider a scenario where k = with a constant 0 < (5 < ^ and e = ;^ with 

positive constant s. We show that with a different choices of rj, T and K, DSVRG can find an e-optimal 
solution solution for o with the optimal parallel runtime, the optimal amount of communication, and 
the optimal number of rounds of communication simultaneously. 

Proposition 1 Suppose k < with a constant 0 < (5 < ^ and we choose rj = T = n and 

^ ~ iog(n'S/ 2 ) )-J(x Algorithm\^ Also, suppose Assumption]^ holds and TK = Q < hm. 

Algorithm\^finds e-optimal solution for [!}) with rounds of communications, 0 (^thl£Q^) total 

parallel runtime, and ) amount of communication. 

In particular, when e = :^ with a positive constant s. Algorithm]^ finds e-optimal solution for (QJj 
with 0(1) rounds of communications, 0{n) total parallel runtime, and 0{m) amount of communication. 

Proof Since 77 = Algorithm [2] guarantees dH) according to Theorem [TJ With T = n and 

^ have 


1 ^ ALr](T -j-1) ^ 1 ^ 8Lr] Idn'^K ^ 1 

- 4Lrj)T (1 - 4Ltj)T “ - ALrf}T ^ (1 - ALp) ~ (1 - l/(4n'5))n 2n^{l- l/(4n'5)) 

1 1 2 
“ 2n^{l — l/(4n'^)) 2n^{l — l/(4n'^)) “ 

Hence, E [/(i^) — f{x*)] < e can be implied by the inequality (O as AT = _ 

Under Assumption [TJ we have h = 17(n) so that the number of rounds of communication Algorithmic] 
needs is 0{K -\- 2^) < 0{K -\- ). Moreover, the total parallel needed is 0((n -I- T)K) = 

O(^t^^UM) and the amount of communication needed is 0((m -I- '^)K) = The second 

conclusion can be easily derived by replacing e with O(^). □ 

The justification for the scenario where k = and e = -^ and why these performance 

guarantees are optimal have been discussed in Section [TT^ 
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4 Accelerated Distributed SVRG 


In this section, we use the generic acceleration techniques in [8] and [14) to further improve the theo¬ 
retical performance of DSVRG and obtain a distributed accelerated stochastic variance reduced gradient 
(DASVRG) method. 


4.1 DASVRG algorithm and its theoretical guarantees 
Following [5] and m, we define a proximal function for f{x) as 

fa{x-,y) = fix) I \\x-yf = ■^'^Mx-,y), ( 8 ) 

where fi{x; y) = fiix) -I- I- ||a; — cr > 0 is a constant to be determined later and ?/ S is a proximal 
point. The condition number of this proximal function is K{fa) = which can be smaller than k when 
(T is large enough. 

Given an algorithm, denoted by A, that can be applied to o, the acceleration scheme developed in 
[5] and [H] is an iterative method that involves inner and outer loops and uses A as a sub-routine in its 
outer loops. In particular, in p-th outer iteration of this acceleration scheme, the algorithm A is applied 
to find a solution for the p-th proximal problem defined on a proximal point j/p-i, namely, 

f* = min f^ix; ?/p_i) for p = 1, 2,..., P. (9) 

The algorithm A does not need to solve ([5]) to optimality but only needs to generate an approximate 
solution Xp with an accuracy Cp in the sense that 


faiXpWp-l) - fl < Cp. ( 10 ) 

When nifa) is smaller than k, finding such an Xp is easier than finding an e-optimal solution for (|T|) . 

Then, the acceleration scheme uses Xp to construct a new proximal point pp using an extrapolation 
update as yp = Xp + fdpixp — ip-i), where /3p > 0 is an extrapolation step length. After that, the p-l- 1-th 
proximal problem is constructed based on pp which will be solved in the next outer iteration. With an 
appropriately chosen value for ct, it is shown by [ 5 ] and m that, for many existing A including SAG m 
[T^ . SAGA [B], SDGA [U], SVRG [TT] and Finito/MISO [711TB]. this acceleration scheme needs a smaller 
runtime for finding an e-optimal solution than applying algorithm A directly to (H)) . 

Given the success of this acceleration scheme in the single-machine setting, it will be promising to 
also apply this scheme to the DSVRG to further improve its theoretical performance. Indeed, this can be 
done by choosing A in this aforementioned acceleration scheme to be DSVRG. Then, we can obtain the 
DASVRG algorithm. In particular, in the p-th outer iteration of the aforementioned acceleration scheme, 
we use DSVRG to solve the proximal problem in a distributed way up to an accuracy Cp. We present 
DASVRG in Algorithm |4| where /i(x; p) = fiix) -|- § ||x — p||^. 

Proposition 2 Suppose rj = T = 96k(/ct), 


K = 


log(9/8) 


log 


10368cr 


^2 y/q 


= O ( log ( 1 -I- - 


and TKP = Q < nm. The solution Xp generated in Algorithm^ satisfies H1U\} with 


ep = n[/(®o) - fix*)] 1 




, for p= 1,2,..., P, 


( 11 ) 


( 12 ) 


Moreover, Algorithm^finds an e-optimal solution for Ufl after P — O log(l/e)^ outer iterations. 
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Algorithm 4 Distributed Accelerated SVRG (DASVRG) 


Input: An initial solution xq G data {fi}i=i,...,N^ the number of machine m, a step length rj < the number of 
iterations T in each stage of DSVRG, the number of stages K of DSVRG, the number of outer iterations P in the 
acceleration scheme, a sample size TKP = Q < hm, and a parameter cr > 0. 

Output: xp 

1: Use DA to generate (a) a random partition Si ,. .., Sm of [N], (b) Q of i.i.d. indices ri ,... ,rQ uniformly sampled with 
replacement form [A^], (c) m multi-sets TZ = {i?i,..., Rm} defined as and (d) data {fi | i G U Rj} stored on 
machine j. 
k 1 

Initialize q = , yq = xq and oq = V? 

for p = 1, 2,..., P do 

Center computes x® = Xp—i and sends yp—i to each machine 
for ^ = 0,1,2,..., K — 1 do 

Center sends x^ to each machine 
for machine j = 1, 2, ..., m in parallel do 

Compute hj = l) send it to center 

end for 

Center computes = -^ sends it to machine k 

(x^+^7^,/c) SS-SVRG(x^/^^7^,A:,??,T,{/^(x;yp_l)}i^l„.,,iv) 

end for 

Machine k computes Xp = x^ and sends Xp to center 


Center computes ap E (0,1) from the equation = (1 — (yp)a^_-^ qoLp. 


Center computes pp = Xp -\- /3p(xp — Xp_i) , where /3p = 

end for 


p- 
-i(i- 




p-i) 


Proof Due to the unbiasedness of Vfi{xt) — Vfi{x^) + dS]), conditioning on Xp-i, the solution path 

x^ ,x^ ,x'^,... generated within the p-th outer loop of Algorithm U] has same distribution as the solution 
path generated by applying single-machine SVRG to © with an initial solution of Xp-i. Hence, all the 
convergence results of SVRG can be applied. 

According to ([7]) in the proof of Theorem [U the choices of r] and T ensure E [fa{xp;yp-i) — /*] < 

(If [/- (ip_i;j/p_i) — /*] . Using this result and following the analysis in Section B.2 in [Tl], we can 
show that Xp satisfies m with Cp given by m if K is set to m- 

Therefore, according to Theorem 3.1 in [14] with p = Algorithm|4|guarantees that E [f{xp) — fix*)] < 

^ 1 ~ 2 ) Ifi^o) — fi^*)] by choosing K as[Tl] This means Algorithm Uj finds an e-optimal solution 

for dH) after P = 0(^ log(l/e)) = O (Y'^^log(l/e)) outer loops. 

The condition TAP = Q < nm here is only to guarantee that we have enough samples in Pi, P 2 , • ■ • > Rm 
to finish a total of TKP iterative updates. □ 


We will choose cr = ^ in DASVRG and obtain the following theorem. 

Theorem 2 Suppose rj = T = 96h;(/o-), K is chosen as \11\) . TKP = Q < hm and a = ^. 
Algorithm^ finds an e-optimal solution for (Q]) after 0((1 -I- y^) log(l -I- log(l/e)) calls of SS-SVRG. 


Proof When cr = ^, according to Proposition [2] DASVRG needs P = Oi^J^^\ogil/e)) = 0((1 -b 
y^)log(l/e)) outer iterations with K = 0(log(l -b ^)) = 0(log(l + ^)) calls of SS-SVRG in each 
according to (fTTll . Hence, a total of KP = 0((1 -b log(l -b log(l/e)) calls of SS-SVRG are needed. 


□ 


When a = ^,we have ^(/cr) = 1- According to Theorem[51 DASVRG can find an 

e-optimal solution for (jH) after 0{{1 -b \/^) log(l/e)) calls of SS-SVRG. Note that each call of SS-SVRG 
involves T = 0{K{fa)) = 0(n) iterative updates. Therefore, there are Q = TKP = 0{{n-\-^/nK) log(l/e)) 
iterative updates in total. Since the available memory space requires Q < hm = {C — n)m, we can derive 
from the inequality 0((n-bi/^) log(l/e)) < (C—n)??! that we need at least C = l7(n-b(^-b^^^^) log(l/e)) 
in order to implement DASVRG. 

Under the assumptions of Theorem [T] including Assumption [T] (so h = I7(n)), we summarize the 
theoretical performance of DASVRG as follows. 
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— Local parallel runtime: Since each call of SS-SVRG involves a batch gradient computation and 
T = 0{n) iterative update, the total runtime of DASVRG is 0{{n-\-T)KP) = 0(n + log(l/e)). 

— The amount of communication: Similar to DSVRG, we need a fixed amount 0{N) communication 
to distribute Si,..., Sm and an additional amount of 0{k^ log^(l/e)/A) communication to distribute 
Ri,, Rm to m machines. During the algorithm, the batch gradient computations and the iterative 
updates together require 0{(m+ ^)KP) = 0{{m+ ?)(1 + log(l/e)) = 0((m + mY^) log(l/e)) 
amounts of communication. 

— The number of rounds of communication: Since each call of SS-SVRG needs 1 + ^ rounds 
communication, the rounds of DASVRG is 0((1 -I- ^)KP) = 0((1 -I- y^) log(l/e)). 

Recall that DSVRG needs 0((1-|- -) log(l/e)) rounds of communication which is more than DASVRG. 
The crucial observation here is that, although in the single-machine setting, the acceleration scheme of [HI 
[T4] only helps when k > TV, in the distributed setting, it helps to reduce the rounds as long as k is larger 
than the number of local samples n. 


5 Lower Bounds on Rounds of Communication 

In this section, we prove that, under Assumption (TJ any distributed first-order method will require at 
least f 2 (y^log(l/e)) rounds to find an e-optimal solution for ([ 1 ]) with both partitioned data and i.i.d. 
sampled data in each machine. This lower bound is matched by the upper bound of the rounds needed 
by DASVRG in Section S] up to some logarithmic terms. We note that we are working under different 
scenarios than [2] : In [5] , the authors assumed that the only property of the data that an algorithm can 
exploit is that the local sums are (5-related for a <5 ~ Xj^Jn, and proved that the number of rounds is 
at least The DASVRG algorithm exploits the fact that data is randomly partitioned and 

outperforms their lower bound. This suggests that i5-relatedness shouldn’t be the only property that an 
algorithm exploits, and motivates us to prove a new (matching) lower bound by assuming the data is 
randomly partitioned. 

5.1 A lower bound for rounds of communication 

We first consider a family of algorithms which consist of a data distribution stage where the functions 
{/J ig[Ar] are distributed onto m machines, and a distributed computation stage where, in each round, 
machines can not only use first-order (gradient) information of the functions stored locally but also apply 
preconditioning using local second-order information (Hessian matrix). 

Definition 1 (Distributed (extended) first-order algorithms Pa) We say an algorithm A for 
solving o with m machines belongs to the family Pa {A G Pa) of distributed first-order algorithms if 
it distributes {/i}iG[Ar] to vn machines only once at the beginning such that: 

1 . The index set [N] is randomly and evenly partitioned into Si, S 2 ,..., Sm with |S'j| = n for j = 

1,...,TO. 

2. A multi-set Rj of size an is created by sampling with replacement from [N] for j = 1,... ,m, where 
a > 0 is a constant. 

3. Let Sj = Sj U Rj. Machine j acquires functions in {fi\i G S''} for j = 1,..., m. 
and let the machines do the following operations in rounds: 

1. Machine j maintains a local set of vectors Wj initialized to be Wj = {0}. 

2. In each round, for arbitrarily many times, machine j can add any w to Wj if w satisfies (c.f. 0) 

'yw + nVFj{w) G si>an{w',VFj{w'), Fj{w') + D)w'', + D)-^w'' \ 

w',w" G Wj,D diagonal, V^S}(r(;') and {V‘^Fj{w') + D)~^ exists} (13) 

for some 7 ,^ such that ^ 0, where Fj = X^iec/jCS' with an arbitrary subset Uj of S'. 

3. At the end of the round, all machines can simultaneously send any vectors in Wj to any other 
machines, and machines can add the vectors received from other machines to its local working set. 
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4. The final output is a vector in the linear span of one Wj. 

We define -4({/i}ig[Ar],i?) as the output vector of A when it is applied to ((T]) for H rounds with the 
inputs {fi}^elN]■ 


Besides the randomness due to the data distribution stage, the algorithm A itself can be a randomized 
algorithm. Hence, the output A{{fi}i^[N], H) can be a random variable. 

We would like to point out that, although the algorithms in Aa can use the local second-order in¬ 
formation like fi{x) in each machine, Newton’s method is still not contained in Fa since Newton’s 
method requires the access to the global second-order information such as \7^f{x) (machines are not 
allowed to share matrices with each other). That being said, one can still use a distributed iteration 
method which can multiply V^/i(x) to a local vector in order to solve the inversion of V^/(a:) approxi¬ 
mately. This method will lead to a distributed inexact Newton method such as DISCO (2^. In fact, both 
DANE [23 and DISCO [55] belong to Fa with a = 0. Suppose, in Assumption (TJ the capacity of each 
machine C is given such that n = cn. The DSVRG and DASVRG algorithms proposed in this paper 
belong to Fc with a = c. 

We are ready to present the lower bounds for the rounds of communications. 

Theorem 3 Suppose k> n and there exists an algorithm A S Fa with the following property: 

“For any e > 0 and any N convex functions {fi}i^[N] where each fi is L-smooth and f defined in (Qp is 
pL-strongly convex, there exists such that the output x = A{{fi}i^[]s[],Hf^) satisfies Mlf (x) —f{x*)] < e.” 

Then, when m > max{exp( ^^^^j^ (e -|- max{l, a})'^}, we must have 


He > 


yj kIu — 1 


4v^((e -|- max{l, a}) log m)^/^ 


log 


1 - 


(e + e“)(e -|- max{l, a})^ 


iT 


> 


n(log m)3 


log( 




-))• 


We want to emphasis that Theorem [3] holds without assuming Assumption fll^ In the definition of Fa, 
we allow the algorithm to access both randomly partitioned data and independently sampled data, and 
allow the algorithm to use local Hessian for preconditioning. This makes our lower bounds in Theorem [3] 
stronger: Even with an algorithm more powerful than first-order methods (in terms of the class of 
operations it can take) and with more options in distributing data, the number of rounds needed to find 
an e-optimal solution still cannot be reduced. 

1 — 2(5 

We note that the condition At > n in Theorem [3] is necessary. Recall that, when k < " < n for 

a constant 0 < (5 < we showed in Proposition [T] in Subsection 13.31 that 0(i^q^;^^) rounds is enough 
for DSVRG. Therefore, the lower bound H > I7(y^log(i)) = C(;^ log(i)) won’t be true for the case 

1-25 

when AC < < n. 


5.2 Proof for the lower bound 

Given a vector a; S and a set of indices D C [(ij, we use xd to represent the sub-vector of x that 
consists of the coordinates of x indexed by D. 

Definition 2 A function / : — >■ K is decomposable with respect to a partition Di,..., Dr of coordi¬ 

nates [d] if / can be written as f{x) = g^{xDf) -b • • • -I- g^ixor), where g'' : RI-^’I —>• R for 1 = 1,..., r. 
A set of functions {/i}ig[Ar] is simultaneously decomposable w.r.t a partition Di,..., Dr if each fi is 
decomposable w.r.t Di,..., Dr- 

It follows the Definition [T] and Definition [2] straightforwardly that: 

Proposition 3 Suppose the functions {/i}ig[Ar] in (Qj) are simultaneously decomposable with respect to 
a partition Di,..., Dr so that fi{x) = 9ii^Di ) with functions g\ : —>■ R for i = 1,... ,N and 

I = 1,... ,r. We have 

® If the capacity C is such that fi = 0 and c = 0, the lower bound given by Theorem [3] still applies to the algorithms in 
J^a with 0 = 0. 
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X% =argminig'(u;) = L /or Z = 1, 2,..., r. (14) 

u,GRI°iI ( i~l ) 

where x* is the optimal solution of m- 

Moreover, any algorithm A € Ta, when applied to {/i}ig[Ar], becomes decomposable with respect to 
the same partition Di,..., Dr in the following sense: For I = 1,... ,r, there exists an algorithm Ai G Ta 
such that, after any number of rounds H, 


r 

mix) - fix*)] = - g^ixhi)], 

1^1 

where x = Ai{fi}ie[N], H) G K"* and = Ai{{g\}i^[N], H) G for I = 1,2,... ,r. 

Remark 1 We note a subtlety that might be important for careful readers: here and throughout the paper, 
by slight abuse of terminology, we consider an “algorithm” as a sequence of operations that satisfies the 
requirement of Definition [TJ In this sense, an algorithm doesn’t have to be describable by a Turing 
machine and it can access any information (e.g., even the minimizer of the sum of functions) as long as 
the operations that it takes satisfies the rules in Definition [TJ This different interpretation of “algorithm” 
makes Theorem |3| even stronger and Proposition |3](which is essentially a reduction statement) true and 
trivial. 


Proof The proof of this proposition is straightforward. Since {fi}i^[N] in d) are simultaneously decom¬ 
posable with respect to a partition Di,..., Dr, we have 


..Nr r 

= Nil'll diixDi) = ^g\xDi) 

SO that the problem can be solved by solving m for each I separately and xf)^ must be the solution 
of the Z-th problem in (ITTl) . 

In addition, the function Fj in Dehnition |T] is also decomposable with respect to the same partition 
Di,..., Dr. As a. result, its gradient \7Fj{x) also has a decomposed structure in the sense that [VFjix)]Di 
only depends on xdi for / = I, 2,..., r. Similarly, its Hessian matrix ’V^Fjix) is a block diagonal matrix 
with r blocks and the Z-th block only depends on xdi- These properties ensure that each operation 
as (fT^ conducted by A can be decomposed into r independent operations as (IT^ and applied on 
XDi,XD 2 t ■ ■ ■ yXDr separately. The data distribution and the sequence of operations conducted by A on 
xdi can be viewed as an algorithm Ai G Fa applied to {gi}i£[N] so that = Ai{{gl}i^iN], H) is indeed 
the subvector xdi of the vector x = Ai{fi}i^[N\T H) for Z = 1, 2,..., r and any iJ. □ 


Now we are ready to give the proof for Theorem [31 
Proof (Theorem [^) 

Let p! = pn, k' = ^ = — and k = (e-|-max{l, a}) logm. Since m > max{exp( 


- - ....X{1,Q}+1) rg_|_ 

max{l,a} ^ 

> f > I for any a > 00 


e+max{l,Q;} 


max{l,a})2}, we can show that v=^= (e+max{Pa}) logm ^ 21 og(e+max{l.a}) ^ 2 
For the simplicity of notation, we will only prove Theorem [3] when k and v are both integers. The general 
case can be proved by a very similar argument only with more sophisticated notations. 

We first use the machinery developed by [11111113] to construct k functions on where b = uk for 
any integer u > 1. In particular, for i,j = 1,... ,b, let Sij be an 6 x 6 matrix with its {i,j) entry being 
one and others being zeros. Let Mq, Mi, ..., he b xh matrices defined as 


{ i5ip for j = 0 

di.i “t“ for 1 "F, i "G. b 2 

s,- a.,.-. + a„ for; = i - 1 . 

Here, we use the factor that is monotonically increasing on [e,+C!o). 
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For s e [k], let Es = ^^k+s-l■ For example, when u = 2 and fc = 3 (so 6 = 6), the matrices Eg^s 

are given as follows. 


r 10 

0 

0 0 0 ] 


1 

-1 0 

0 

0 0 ] 


ro 

0 

0 0 

0 

0 1 

0 0 

0 

0 0 0 


-1 

1 0 

0 

0 0 


0 

1 

-1 0 

0 

0 

0 0 

1 

-10 0 


0 

0 0 

0 

0 0 


0 

-1 

1 0 

0 

0 

0 0 

-1 

1 0 0 

7 ^2 — 

0 

0 0 

1 

-1 0 

7 ^3 — 

0 

0 

0 0 

0 

0 

0 0 

0 

0 0 0 


0 

0 0 

-1 

1 0 


0 

0 

0 0 

1 

-1 

0 0 

0 

0 0 0 


0 

0 0 

0 

0 0 


0 

0 

0 0 

-1 

■\/K'-\-k— l+SV k 














\/K,'-\-k—l-\--\/k -1 


We dehne k functions pi,... ,pk : M** —>■ K as follows 


Ps{w) 



w'^ Uiw 

2 

w'^ UsW 

2 



ej w 


^||?c|P for s = 1 

for s = 2,..., fc. 


(15) 


where ei = (1,0,..., 0)^ G R^, and denote their average by p = -^ Ss=i Ps- According to the condition 
K > n, we have 1 — ^ > 0 so that ps for any s G [fc] and p are all /i'-strongly convex functions. It is also 
easy to show that Ai„ax(21’s) < 4 so that Vps has a Lipschitz continuity constant of L ^1 — + p' = L 

and Ps is L-smooth. 

Next, we characterize the optimal solution of min^ggb ^(w). 


Lemma 2 Let 6, G K 6e the smaller root of the equation 


namely, 


h^-2 


/ k' — 1 + 2k'\ 

V «'-i ) 


h + l = 0, 


^ \/ k' + k — 1 — Vk 

y/ k' + k — 1 + y/k 

Then, w* = (w* ,W 2 , ■ ■ ■ ,wlY' £ with 


w* = h\ for j = 1 , 2 ,..., b 
is the optimal solutions of p(w). 

Proof By definition, we provide the following explicit formulation of p{w) 

/ k \ 


Observing that 


p{w) = 


L' -p' 
Ak 


1 


-w 


^s=l 


w — efw 




(16) 


k 

E 

S=1 


A, = 


2 -2 0 0 0 0 

-1 2 -2 0 0 0 

0 - 12-10 0 

0 0 0 -1 2 -1 


0 0 0 0 -1 V f+k-i+ 3^^ 

Vk'+ki— 1 + vfc 

and following m, we can show that w* must satisfy the following optimality conditions 
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( 17 ) 


w*2-2 


/ k' — 1 + 2k\ 

V «'-i ) 


+ 1 = 0 


* 

^^■+1 

^/ k' + k — 
1y k' + k — 


T+3^ 4fc \ 

T+Vk k' -ij 


w* + w*_i =0, for j = 2, 3,..., 6 — 1 
wl + K_i = 0 


We can easily verify that w* = for j = 1, 2,..., & satisfy all equations (dzi) and is the optimal solution 
of min^gRi,p('u;). □ 

We claim that {ps}se[/c] has the following property which directly follows our construction. 

Lemma 3 Suppose U is a strict subset of {ps}se[k ]7 OLud q is an arbitrary linear combination of Ps in 
U. The Hessian of q is a block diagonal matrix where each block has a size of at most k. 

Proof Suppose Ps' is not in U for some s' S [fc]. Since g is a linear combination of p^’s in U, according 
to the construction in m, the Hessian of g is a linear combination of one diagonal matrix and all Sg ’s 
except Eg’, which is a tridiagonal matrix. We note that Eg/ is the only matrix among all Eg^s that has 
non-zero entries in the positions (s' — 1 -I- ifc, s' + ik) and {s' + ik, s' — 1 + ik) for i = 0, 1,..., u — 1 and 
these positions are periodically repeated with a period of k. Therefore, without Eg/ involved in the linear 
combination, the tridiagonal Hessian becomes block diagonal with each block of a size at most k. □ 

To complete the proof of Theorem [31 the following lemma is critical. This lemma tells us that the 
property given by Lemma [3] forces the machines to perform a large number of rounds of communication 
in order to minimize p whenever {ps}se[/c] do not appear together in any machine. 

Lemma 4 Suppose b (or u) is large enough. Let {gi}i£[N] be functions on that consists of v copies 
of {Ps}s^[k] defined as hl5\) and (n — l)vk zero functions, that is, 


. ( pg{w) if i = s,s + k,s + sk,... ,s + {v - l)k 

(0 if i>vk + l. ^ ' 

Let g = ^ 9i- have w* = argmin^g^i, p{w) = argmin^gjji, g{w) where w* is defined as U6\) . 

Suppose an algorithm A € Ta is applied to {gi}i^[N]- Lot £ be the random event that none of the m 
machines has all functions in {ps}sG[fe] either Sj or Rj) after the data distribution stage of A and let 

w = A{{g^}ie[N],H). Then, to ensure E[g{w) - g{w*)\£] < e, we need H = ( log ^ 

Proof Since g = jfT,f=i9i = = n’ ^^ave w* = argmin,^gR6 g(a;) = argmin,,gRi, p(x) by 

Lemma m where w* is defined as in (IT31) . 

Let Eq = {0} and Et be the linear space spanned by the unit vectors Ci,..., e* for t = 1,..., 6. Suppose 
event £ happens. Every machine will only have a strict subset U of {ps}sG[fe]- Lemma [3] guarantees that, 
under algorithm A, if machine j starts one round with a set of working vectors Wj C Et, then Wj is 
always contained by the space Et+k after this round. Therefore, we can show that, at the beginning of 
round i in algorithm A, if UjLiWj C Et, then at the end of round i (and at the beginning of round 
£-1-1), we have UjWj C Et+k- Using this finding and the fact that UWj- = {0} = Eq initially, we conclude 
that, after H rounds in A, UjWj C Enk- Let t = Hk. Since w = A{{gi}i^[N], H), we must have w S Et- 
By (fT31) , we can show that 


Following the analysis in HHZlllS] and using the /r'-strong convexity of p, we have 


( 19 ) 
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E[p{w)-p{w*)\£] > ^E[\\w-w*r\£] > ^ 


E 




j=t+i 


1 - /l2 


1 - /l2fc 


where the second inequality is because w £ Et and the third inequality is due to (fTOl) . When b (or u) is 
large enough, the inequality above implies 


E[p(u)) — p{w*)\£] > 


/r'||'u;*pft,2t ^'||iu*||2 I v^k' + fc — 1 — y/k'' 

4 4 \-\/K' + fc—1 + \/k 


Based on this inequality, when E[g(r(i) — g{w*)\£] < e, or equivalently, when E[p(r()) — p{w*)\£] < ne, we 
must have 


log 


^ 2iiog ( Vk' + fc - 1 + 


4ne 


^/k' + k — \ — Vk 


< 2tlog 1 + 


2^/k 


< 


Aty/k 


k' + k — \ — Vk I Vk' + fc — 1 — \/fc 


which further implies 



/ Vk' + k - 1- Vk\ 

ifcTI ) ® ) 


□ 


We now complete the proof of Theorem |3] by constructing N special functions {/i jig [at] on with 
d = nb for a sufficiently large b (or u) based on {ps}sG[fc]i so that any algorithm A £ Eon when applied 
to {/i}iG[Ar], will need at least the targeted amount of rounds of communication. 

We partition the set of indices [d] = {1,..., d} into n disjoint subsets Di, D 2 , ■ ■ ■, Dn with \Dj \ — b 
and Dj = {b{j — 1) + 1,..., bj}. For any j £ [n] and s G [fc], let qj,s{x) be a function on such that 
qj,s{x) = Ps{xDj), which means qj,s{x) only depends on the b coordinates of x indexed by Dp Therefore, 
we obtain nk different functions {gys}jG[rt],sG[fe]■ Finally, we define {fi}i^[N] to be a set that consists of 
V copies of {( 7 j_s}jg[„]_sg[fe] (recall that N = vkn and u > 1 is an integer). Because 


^ n k n k 

/(^) = w E ^ E E ‘ixsix) = ^ ) 


Z=1 


i=i s=i 


j = l S=1 




j=l s=l 




and LemmalU the optimal solution x* for (IT|) with {fi}i£[N] constructed as above is x* = {w* ,w*,... ,w*)'^ 
where w* £ R'^ is defined as (USD and is repeated for n times. 

Now, we want to verify that functions {fi}i^[N] satisfy our assumptions. In fact, we have shown that 
Ps is L-smooth for each s £ [fc]. Since fi is either an zero function or equals Ps{xDj) for some j £ [n] 
and s £ [fc], the function fi is L-smooth for each i £ [A^] as well. Since p is /r'-strongly convex (on 
R**) and p' = n/r, the function / defined in 0 must be ^-strongly convex (on R'^) according to the 
relationship (EOl). 

According to its construction, {/i}ig[Ar] are simultaneously decomposable with respect to a partition 
Di,..., Dn with Dj = {d(j — 1) -I- 1,..., dj} (see Definition (2) . In particular, for any i £ [A^], fi{x) = 
J2?=i 9i{xDi) where g\ £ {ps}sG[fc] for exactly one I G [n] and gl = 0 for other Vs. Moreover, for any I £ [n], 

{9l}ielN] = {5i}iG[Af] where {5i}iG[iV] are defined as ((Til) such that g’’ = jf gl = jf J2f=i 9i = 9- 

By Proposition 131 A can be decomposed with respective to the same partition Di,D 2 , ■ ■ ■ ^ Dn into 
Ml,... ,An £ Ea and Ai is applied to {gi}i£[N]- Following Definition [1] let ^i ,..., Sm be the random 
partition of [A^] and and i?i,..., Rm be set of i.i.d. indices uniformly drawn from [A^] with \Rj \ = an. Let 
Sj = Sj U Rj. Then, the algorithm Ai will allocate {gi\i £ 5'} to machine j and start the computation 
in rounds. 

We now focus on the solution generated by Ai for any 1. For machine j in Ai, let Yi j- be the number of 
functions in {ps}sG[fe] (repetitions counted) that are contained in Sj and Y 2 J be the number of functions 
in {ps}sG[fe] (repetitions counted) that are contained in Rj. 
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Due to m, the function gi is not a zero function if and only ii 1 < i < vk = m. Hence, Yij has 
a hypergeometric distribution where Prob(Yi_j = r) equals the probability of r successes in n draws, 
without replacement, from a population of size N that contains exactly m successes. According to 
Chvatal [4], we have 


Prob(yij- > r) < 



n — 1 


n — r 


n—r 


which, when r = elogm, implies 


Prob(Fij- > elogm) < 


\ elogm / -, \ n—elogm 

1 \ f n—1 ^ 


elogm 


n — e log m 


\ elogm / 1 -, \ n—elogm 

1 \ / e log m — 1 ^ 

1 + 


< 


e log m J \ n — e log m 

e log m 

e log m— 1 


1 


elogm 
1 / 1 


e log m 


e \logm 

1 / 1 

< - 


< 


e \ 21 og(e + 1 ) 

1 


e log m 


( 21 ) 


where the second inequality is because (1 + i)^ < e for any a: > 0 , the third inequality is due to the 
assumption that m > max{exp( ^^^“^ ~*^^); (e+max{l, a})^} > (e + 1 )^, and the last inequality 

is because ( 21 og(e + 1 ))® > e^. 

On the other hand, we can represent Y 2 J = ^r<vk which is the sum of an i.i.d. binary random 

variables lr.<i,fc’s which equal one with a probability of ^ ^ and zero with a probability of 1 — By 

Chernoff inequality of multiplicative form, we have 


Prof(H 2 ,j > max{l,a}logm) < 


/ 


maxil.ad 1 t 

3 -log m —1 


max{ 1 ,Q;} 


( max| 1,Q:| 1 \ 

-^^logmj 

gmax{l,Q:} log m—a. 


log m 


( max|l.al ^ \ 

-log m J 


1 


e“ \max{l, a} logm 


max{l,a} log m 

max{l,a} log m 


^ 1 / 1 

— pa I -2—^ 

C ygmax{l,Q;} 

1 

e“m 2 ’ 


max{l,Q!} log m 


( 22 ) 


^ _i_ 1 

where the second inequality is because of the assumption that m > max{exp( ^^^“^ ^^ gmax{i,a} (g _|_ 
max{l,a}) 2 } > 

Combining m and (I22p for j = 1, 2,..., m and using the union bound, we have 


Prob(Yi > elogm for some j or Y 2 > max{l,a}logm for some j) < - 1 -= 7 -r—, 

em e“m (e + e“)m 
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which implies 


ProbfFi j + >2 7 < (e + max{l, a}) logm for j = 1, 2,..., m) > 1 — -r—. 

’ ’ (e + e“)m 

Therefore, we have shown that, with a probability of at least 1 — all of the sets S'l,..., 

contain fewer than (e + maxjl, a}) logm = k functions from {_Ps}se[fe] (repetition counted). In other 
words, with a probability of at least 1 — none of the sets S'(,..., contains all of the functions 

in {_Ps}sG[fe]- If event that “none of the sets S'l, ..., S'^ contains all of functions of {ps}sG[fc]” (same 
as the event £ in Lemma S]) indeed happens in Ai, we call Ai bad. Then, we have actually proved 

Prob(^/ is bad) > ( 1 — ;— ) > ( 1 — 7 - 77 - - -p ;—7777 ) • 

\ (e + e“)m/ \ (e + e“)(e + max{l, a})^ / 

By Proposition [3] and {gl}ie[N] = {gi}ie[N], after H rounds, the solutions x = A{{fi}i^[N], H) S 
and w'- = Ai{{gl}ie[N],H) = Ai{{gi}^e[N], H) S for ^ = 1, 2 ,..., n satisfy 


Hf{x) - fix*)] = ^E[5'(w') - g\x*ijJ] 

1^1 

n 

= '^H9iw^) - 9ix*Di)] 

1^1 

n 

> ^E[5(u)^) — g{xf)J\Ai is bad]Prob(^/ is bad) 


1=1 


> 


'^E[g{w^) - g{x*o^)\Ai is bad] fl- 


1 


1=1 


\ (e + e“)(e + max{l, a}) 


2 / ■ 


Therefore, if E[/(i) — fix*)] < e, there must exist an I S [n] such that 


^9iw'‘) - 9ix*Dt)\A is bad] ( 1 - 


1 


(e + e“)(e + max{l, a})^/ n 


e 

< 


(23) 


When Ai is bad, after the data distribution stage, none of the m machines in Ai has all functions in 
{ps}sG[fc]- According to Lemma 01 we know that to ensure (1331) . Ai needs 




> 


y Aky/k 

(log 

UV2kVk) ® 


1 


gn]]w*]\'‘ 


1 - 


(e + e“)(e + max{l, a})^y 4e 
1 \ fm]]w*]]'^' 


(e + e“)(e + max{l,a})2y 4e 
which is the desired lower bound after plugging in /c = (e + max{l, a}) logm. 


□ 


6 Numerical Experiments 

In this section, we conduct numerical experiments to compare our DSVRG and DASVRG algorithms 
with the DisDSCA (with its practical updates) and a distributed implementation of the accelerated 
gradient method (Accel Grad) by Nesterov [IT]. We apply these four algorithms to the ERM problem 
m with three datasetJ^: Govtype, Million Song and Epsilon. According to the types of data, the loss 
function (j)ix, f) in ([S]) is chosen to be the square loss in ridge regression for Million Song and the logistic 

http://www.csie.ntu.edu.tw/-cjlin/libsvmtools/datasets/binary.html 
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Fig. 1 Comparing the DSVRG and DASVRG methods with DisDGA and the accelerated gradient method (Accel Grad) 
in rounds. 

loss in logistic regression for the other two datasets. Following the previous work, we map the target 
variable of year from 1922 ^ 2011 into [0,1] for the Million Song data. We notice that the original 
Covtype and Million Song datasets are not very large (Covtype has 62M in the original size and Million 
Song has 450M in the original size), and both our algorithms and DisDCA can finish quickly on our 
server. Therefore, to make comparison among these algorithms in a more challenging setting, we conduct 
experiments using random Fourier features (RFF) [18] on Covtype and Million Song datasets. The RFF 
is a popular method for solving large-scale kernel methods by generating finite dimensional features, 
of which the inner product approximate the kernel similarity. We generate RFF corresponding to RBF 
kernel. Finally, Covtype data has N = 522,911 examples d = 1,000 features. Million Song data has 
N = 463,715 examples and d = 2,000 features. Since the original Epsilon data is large enough (12G), 
we use its original features. 

The experiments are conducted on one server (Intel(R) Xeon(R) CPU E5-2667 v2 3.30GHz) with 
multiple processes with each process simulating one machine. We first choose the number of processes 
(machines) to be m = 5. To test the performances of algorithms for different condition numbers, we 
choose the value of the regularization parameter A in (I3|) to be 1/N'^-^, and 1/A^. Eor each 

setting, L is computed as ll°dl — 1 _ ^ -v^here ^ is the Lipschiz continuous constant of Va;0(x, ^), 

and is equal to A. We implement DSVRG by choosing rj = T — 10, 000 and K = y - For DASVRG, 
we choose ry = ■^, T = 10, 000, K = 1 and P = ^. In both DSVRG and DASVRG, we directly choose 
Rj = Sj (so that \Rj\ = ^ and Q — N) since it saves the time for data allocation and, in practice, 
gives performances very similar to the performances when Rj is sampled separately. Eor DisDGA, we use 
SDGA as the local solver so that it is equivalent to the implementation of GoGoA-l- with a' = m and 
7 = 1 as in the experiments in [15]. We run SDCA for T = 10, 000 iterations in each round of DisDCA 
with Y rounds in total. 

The numerical results are presented in Figure[T]and Eigurej^J The horizontal axis presents the number 
of rounds of communication conducted by algorithms in Figure [T] and presents the parallel runtime (in 
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Fig. 2 Comparing the DSVRG and DASVRG methods with DisDGA and the accelerated gradient method (Accel Grad) 
in runtime. 

seconds) used by the algorithms in Figure [2j In both figures, the vertical axis represents the logarithm of 
optimality gap. According to Figure[T]and Figure[5J the performances of all algorithms get worsen when A 
decreases (so the condition number increases). We find that DSVRG and DASVRG have almost identical 
performances in rounds of communication and they both outperform the other two methods significantly. 
This shows the merit of our methods when applied to computer clusters with a high communication cost 
due to significant network delay. DSVRG and DASVRG have slightly different performances in runtime 
and they outperform the other two methods in Million Song data and obtain a comparable performance 
on Covtype data. DSVRG and DASVRG do not perform as good as DisDGA in runtime on Epsilon data. 

To compare the performances of algorithms under different values of m. We choose the m = 10 and 
15 and repeat the same experiments on Epsilon data. The numerical results based on the rounds and 
runtime are shown in Figure [3] and Figure S] respectively. Similar to the case of m = 5, our DSVRG and 
DASVRG requires fewer rounds to reach the same e-optimal solution but might require longer runtime 
on some dataset. 


7 Conclusion 

We propose a DSVRG algorithm for minimizing the average of N convex functions which are stored in m 
machines. Our algorithm is a distributed extension of the existing SVRG algorithm, where we compute 
the batch gradients in parallel while let machines perform iterative updates in serial. Assuming sufficient 
memory in each machine, we develop an efficient data allocation scheme to store extra functions in each 
machine to construct the unbiased stochastic gradient in each iterative update. We provide theoretical 
analysis on the parallel runtime, the amount and the rounds of communication needed by DSVRG to find 
an e-optimal solution, showing that it is optimal under all of these three metrics under some practical 
scenario. Moreover, we proposed a DASVRG algorithm that requires even fewer rounds of communication 
than DSVRG and almost all existing distributed algorithms using an acceleration strategy by [8] and [14] . 


21 











































































N2^ 



Fig. 3 Comparing the DSVRG and DASVRG methods with DisDGA and the accelerated gradient method (Accel Grad) 
in rounds. 
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Fig. 4 Comparing the DSVRG and DASVRG methods with DisDGA and the accelerated gradient method (Accel Grad) 
in runtime. 
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